{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 10.6 求近义词和类比词\n",
    "## 10.6.1 使用预训练的词向量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.1.0\n"
     ]
    }
   ],
   "source": [
    "import collections\n",
    "import math\n",
    "import random\n",
    "import sys\n",
    "import zipfile\n",
    "import time\n",
    "import os\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "from collections import defaultdict\n",
    "\n",
    "sys.path.append(\"..\") \n",
    "import d2lzh_tensorflow2 as d2l\n",
    "print(tf.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tensorflow并没有像pytorch或者mxnet那样的glove或fasttext那样的运行库，故我们必须要去Glove的官方网站下载对应的词向量文件到文件夹中，并进行解压。\n",
    "预训练的GloVe模型的命名规范大致是“模型.（数据集.）数据集词数.词向量维度”。\n",
    "更多信息可以参考GloVe和fastText的项目网站[1,2]。下面我们使用基于维基百科子集预训练的50维GloVe词向量,下载地址:http://nlp.stanford.edu/data/glove.6B.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "class tensorflow_glove:\n",
    "    def __init__(self, glove_filename):\n",
    "        self.word_to_index = dict()\n",
    "        self.index_to_embedding = []\n",
    "        self.index_to_word = dict()\n",
    "        with open(glove_filename, 'r', encoding='utf-8') as glove_file:\n",
    "            for (i, line) in enumerate(glove_file):\n",
    "\n",
    "                split = line.split(' ')\n",
    "\n",
    "                word = split[0]\n",
    "\n",
    "                representation = split[1:]\n",
    "                representation = np.array(\n",
    "                    [float(val) for val in representation])\n",
    "                self.word_to_index[word] = i\n",
    "                self.index_to_word[i] = word\n",
    "                self.index_to_embedding.append(representation)\n",
    "        _WORD_NOT_FOUND = [0.0] * len(representation)\n",
    "        _LAST_INDEX = i + 1\n",
    "        self.word_to_index = defaultdict(lambda: _LAST_INDEX,\n",
    "                                         self.word_to_index)\n",
    "        self.index_to_embedding = np.array(self.index_to_embedding +\n",
    "                                           [_WORD_NOT_FOUND])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "glove = tensorflow_glove(\"C:/Users/HP/dive into d2l/code/chapter10_natural-language-processing/embeddings/ GloVe.6B/glove.6B.50d.txt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "一共包含400001个词。\n"
     ]
    }
   ],
   "source": [
    "print(\"一共包含%d个词。\" % len(glove.index_to_embedding))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "word_to_index为单词和序号的字典，index_to_embedding为相对应的词嵌入向量。index_to_word为下标到单词的字典"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "word_to_index, index_to_embedding, index_to_word = glove.word_to_index,glove.index_to_embedding,glove.index_to_word"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "打印词典大小。其中含有40万个词。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以通过词来获取它在词典中的索引,并获取该词的词嵌入"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3366, 'beautiful')"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_to_index['beautiful'], index_to_word[3366]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(400001, 50)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vocab_size, embedding_dim = index_to_embedding.shape\n",
    "vocab_size,embedding_dim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10.6.2 应用预训练词向量\n",
    "\n",
    "下面我们以GloVe模型为例，展示预训练词向量的应用。\n",
    "\n",
    "### 10.6.2.1 求近义词\n",
    "\n",
    "这里重新实现10.3节（word2vec的实现）中介绍过的使用余弦相似度来搜索近义词的算法。为了在求类比词时重用其中的求$k$近邻（$k$-nearest neighbors）的逻辑，我们将这部分逻辑单独封装在`knn`函数中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def knn(W, x, k):\n",
    "    # 添加的1e-9是为了数值稳定性\n",
    "    cos = tf.reshape(tf.matmul(W, x),shape=[-1])/ tf.sqrt(tf.reduce_sum(W * W, axis=1) * tf.reduce_sum(x * x) + 1e-9)\n",
    "    _, topk = tf.math.top_k(cos, k=k+1)\n",
    "    topk=topk.numpy().tolist()\n",
    "    return topk, [cos[i] for i in topk]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_similar_tokens(query_token, k, embed):\n",
    "    index = word_to_index[query_token]\n",
    "    topk, cos = knn(embed.index_to_embedding,\n",
    "                    tf.reshape(embed.index_to_embedding[index],(50,1)), k)\n",
    "    for i, c in zip(topk[1:], cos[1:]):  # 除去输入词\n",
    "        print('cosine sim=%.3f: %s' % (c, (index_to_word[i])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "已创建的预训练词向量实例`glove_6b50d`的词典中含40万个词和1个特殊的未知词。除去输入词和未知词，我们从中搜索与“chip”语义最相近的3个词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cosine sim=0.856: chips\n",
      "cosine sim=0.749: intel\n",
      "cosine sim=0.749: electronics\n"
     ]
    }
   ],
   "source": [
    "get_similar_tokens('chip', 3, glove)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来查找“baby”和“beautiful”的近义词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cosine sim=0.839: babies\n",
      "cosine sim=0.800: boy\n",
      "cosine sim=0.792: girl\n"
     ]
    }
   ],
   "source": [
    "get_similar_tokens('baby', 3, glove)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cosine sim=0.921: lovely\n",
      "cosine sim=0.893: gorgeous\n",
      "cosine sim=0.830: wonderful\n"
     ]
    }
   ],
   "source": [
    "get_similar_tokens('beautiful', 3,  glove)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10.6.2.2 求类比词\n",
    "\n",
    "除了求近义词以外，我们还可以使用预训练词向量求词与词之间的类比关系。例如，“man”（男人）: “woman”（女人）:: “son”（儿子） : “daughter”（女儿）是一个类比例子：“man”之于“woman”相当于“son”之于“daughter”。求类比词问题可以定义为：对于类比关系中的4个词 $a : b :: c : d$，给定前3个词$a$、$b$和$c$，求$d$。设词$w$的词向量为$\\text{vec}(w)$。求类比词的思路是，搜索与$\\text{vec}(c)+\\text{vec}(b)-\\text{vec}(a)$的结果向量最相似的词向量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_analogy(token_a, token_b, token_c, embed):\n",
    "    vecs = [embed.index_to_embedding[glove.word_to_index[t]] \n",
    "                for t in [token_a, token_b, token_c]]\n",
    "    x = vecs[1] - vecs[0] + vecs[2]\n",
    "    topk, cos = knn(embed.index_to_embedding, tf.reshape(x,(50,1)), 1)\n",
    "    return glove.index_to_word[topk[0]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "验证一下“男-女”类比。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'daughter'"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_analogy('man', 'woman', 'son',glove)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "“首都-国家”类比：“beijing”（北京）之于“china”（中国）相当于“tokyo”（东京）之于什么？答案应该是“japan”（日本）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'japan'"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_analogy('beijing', 'china', 'tokyo', glove)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "“形容词-形容词最高级”类比：“bad”（坏的）之于“worst”（最坏的）相当于“big”（大的）之于什么？答案应该是“biggest”（最大的）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'biggest'"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_analogy('bad', 'worst', 'big', glove)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "“动词一般时-动词过去时”类比：“do”（做）之于“did”（做过）相当于“go”（去）之于什么？答案应该是“went”（去过）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'went'"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_analogy('do', 'did', 'go', glove)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 小结\n",
    "\n",
    "* 在大规模语料上预训练的词向量常常可以应用于下游自然语言处理任务中。\n",
    "* 可以应用预训练的词向量求近义词和类比词。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 参考文献\n",
    "\n",
    "\n",
    "[1] GloVe项目网站。 https://nlp.stanford.edu/projects/glove/\n",
    "\n",
    "[2] fastText项目网站。 https://fasttext.cc/\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
